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Abstract 

Video classification has advanced tremendously over the 
recent years. A large part of the improvements in video clas¬ 
sification had to do with the work done by the image clas¬ 
sification community and the use of deep convolutional net¬ 
works (CNNs) which produce competitive results with hand¬ 
crafted motion features. These networks were adapted to 
use video frames in various ways and have yielded state of 
the art classification results. We present two methods that 
build on this work, and scale it up to work with millions 
of videos and hundreds of thousands of classes while main¬ 
taining a low computational cost. In the context of large 
scale video processing, training CNNs on video frames 
is extremely time consuming, due to the large number of 
frames involved. We propose to avoid this problem by train¬ 
ing CNNs on either YouTube thumbnails or Flickr images, 
and then using these networks’ outputs as features for other 
higher level classifiers. We discuss the challenges of achiev¬ 
ing this and propose two models for frame-level and video¬ 
level classification. The first is a highly efficient mixture of 
experts while the latter is based on long short term memory 
neural networks. We present results on the Sports-IM video 
dataset (1 million videos, 487 classes) and on a new dataset 
which has 12 million videos and 150,000 labels. 

1. Introduction 

Video classification is the task of producing a label that 
is relevant to the video given its frames. A good video level 
classifier is a one that not only provides accurate frame la¬ 
bels, but also best describes the entire video given the fea¬ 
tures and the annotations of the various frames in the video. 
For example, a video might contain a tree in some frame, 
but the label that is central to the video might be something 
else (e.g., “hiking”). The granularity of the labels that are 
needed to describe the frames and the video depends on the 
task. Typical tasks include assigning one or more global la- 



Figure 1; Overview of the MiCRObE training pipeline. 


bels to the video, and assigning one or more labels for each 
frame inside the video. In this paper we deal with a truly 
large scale dataset of videos that best represents videos in 
the wild. Much of the advancements in object recognition 
and scene understanding comes from convolutional neural 
networks [5, 14, 16, 28]. The key factors that enabled such 
large scale success with neural networks were improve¬ 
ments in distributed training, advancements in optimization 
techniques and architectural improvements [29]. While the 
best published results [19] on academic benchmarks such 
as UCF-101 use motion features such as IDTF [32], we will 
not make use of them in this work due to their high compu¬ 
tational cost. 

Training neural networks on video is a very challenging 
task due to the large amount of data involved. Typical ap¬ 
proaches take an image-based network, and train it on all 
the frames from all videos in the training dataset. We cre¬ 
ated a benchmark dataset on which this would simply be 
infeasible. In our dataset we have 12 million videos. As¬ 
suming a sampling rate of 1 frame per second, this would 
yield 2.88 billion frames. Training an image-based on such 
a large number of images would simply take too long with 
current generation hardware. Another challenge which we 
aim to address is how to handle a very large number of la- 
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bels. In our dataset we have 150,000 labels. 

We approach the problem of training using such a video 
corpus using two key ideas: 1) we use CNNs that were 
trained using video thumbnails or Flickr images as base 
features; and 2) the scale is large enough that only dis¬ 
tributed algorithms may be used. Assuming a base image- 
based CNN classifier of 150,000 classes, and that on aver¬ 
age 100 of these classes trigger per frame, an average video 
in our dataset would be represents using 24,000 features. In 
a naive linear classifier this may require up to 3.6 billion 
weight updates. Assuming a single pass over the data, in 
the worst case it would generate 43 x 10^® updates. 

The main contribution of this paper is describing two or¬ 
thogonal methods which can be used to learn efficiently on 
such a dataset. The first method consists of a cascade of 
steps. We propose to use an initial relatively weak classi¬ 
fier to quickly learn feature-class mappings while pruning 
as many of these correlations as possible. This classifier 
is then used for hard negative mining for a second order 
classifier which then is improved iteratively. The second 
method employs an optimized neural network architecture 
using long short-term memory (LSTM) neurons [12] and 
hierarchical softmax [21], while using a distributed training 
architecture [8]. 

We present two methods for both frame-level and video¬ 
level classification. The first, named MiCRObE (Max Cal¬ 
ibration mixtuRe Of Experts, see Eigure 1) is described in 
Section 3, while the second method which we abbreviate as 
LSTM is described in Section 4. 

2. Related Work 

Our work is targeted at large scale video classification. 
In terms of public benchmarks, the largest publicly avail¬ 
able benchmark is Sports-IM [16], which contains over one 
million videos, and 487 labels. The best performing classi¬ 
fication method on the Sports-IM benchmark has been us¬ 
ing a frame-level convolutional deep neural network, with 
either max-pooling or LSTM on top [22]. Using the same 
benchmark, Karpathy et al. [16], and Tran et al. [31] pro¬ 
pose using a convolutional deep network for making frame 
and video-level predictions, while Karpathy et al. [16] also 
present results on using hand-crafted features and deep net¬ 
works. The inputs to these networks is raw pixels, and the 
networks are trained through convolutional part, resulting 
in a very long training time. Other large scale video classi¬ 
fication methods [3, 30, 33] used hand-crafted features and 
per-class AdaBoost classifiers, but were only able to use a 
fraction of the videos to train the per-class models. Un¬ 
like previous work, our goal is to provide fast training times 
with models capable of frame-level and video-level predic¬ 
tion, while allowing for a much larger number of labels and 
videos to be used for training. 

Many of the best performing models in machine learning 
problems such as image classification, pattern recognition 


and machine translation come by fusion of multiple classi¬ 
fiers [24, 25]. Given scores pj for j = 1, ... ,M from each 
of the M sources for a label I, the traditional fusion problem 
is a function pi = that maps these probabilities to 

a single probability value. This problem is well studied and 
one of the most popular techniques is Bayes fusion which 
has been successfully applied to vision problems [17, 26]. 
Voting based fusion techniques like majority and sum based 
are extremely popular mostly because of they are simple 
and non parametric. The current best result on image net is 
based on a simple ensemble average of six different classi¬ 
fiers that output ImageNet labels [13]. 

The fundamental assumption in these settings is that 
each of the M sources need to speak the same vocabu¬ 
lary as that of the target. What if the underlying sources 
do not speak the same vocabulary, yet output semantically 
meaningful units? Eor example, the underlying classifier 
only detected canyon, river and rafting. Can we learn to 
infer the probability of the target label being Grand Canyon 
from these detections? Another extreme is to have the un¬ 
derlying classifier so fine-grained that it has (for example) 
the label African elephant, but does not have the label ele¬ 
phant. If the label elephant is present in the target vocab¬ 
ulary, can we learn to infer the relation African elephant 
^ elephant organically from the data? One approach is to 
treat the underlying classification outputs as features and 
train the classifiers for each label based on these features. 
This idea has been proposed in the context of scene classifi¬ 
cation [20]. This approach can quickly run into the curse of 
dimensionality, especially if the underlying feature space is 
huge (which is indeed the case for our problem). 

3. MiCRObE 

3.1. Feature Calibration 

The process of calibration is to learn a one dimensional 
model that computes the probability of each label e given 
a single feature /. Here, the sparse features can be seman¬ 
tic units that may or may not speak in the same vocabu¬ 
lary as the target labels. In addition to providing a sim¬ 
ple max-calibration based label classifier, the calibration 
process also helps in feature selection that can be used to 
significantly speed up the training of classifiers like SVM 
or logistic regression. The feature selection process yields: 
(a) Automatic synonym expansion using visual similarity: 
As an example, we allow the sparse feature named Canyon 
from one of our base models to predict an entity Grand 
Canyon which is not in the set of input sparse features. 
Similarly Clock Tower will be able to predict Big Ben. (b) 
Automatic expansion to related terms based on visual co¬ 
occurrence: Eor example, we will get the feature water for 
the label boat which can be used as a supporting evidence 
for the boat classifier. 

In other words, “Canyon”, “Clock Tower”, “cooking”, “wa- 


ter” are features but “Grand Canyon”, “boat” and “Big Ben” 
are labels. Formally put, the number of input features is a 
sparse 150,000 dimensional vector which is a combination 
of predictions from various classifiers. The output is a tar¬ 
get label set of labels. The calibration model is a function 
Pe|y (cc) that is defined over pairs of label (e) and feature (/) 
that is learned according to an isotonic regression. We use 
a modified version of the Platt’s scaling [23] to model this 
probability: 

Pe|/(a:) =a(cr(/3a: + 7 )-cr( 7 )) (1) 

where a{x) = is the sigmoid function and 

a, /?, 7 are functions of e and /. We enforce a, /3 > 0 
so that the function pg|j(x) monotonically increases with 
X (the feature value). Furthermore, since Pf,^f{x) is a proba¬ 
bility, we need to enforce that Pe|/( 2 ;max) < 1 where Xmax 
is the maximum feature value from the training data. The 
scale a allows the estimated probability to plateau to a value 
less than 1.0 (a property that cannot be enforced in normal 
Platt’s scaling). For example, one of the input sparse fea¬ 
ture is the detection “Canyon” from a base image classi¬ 
fier. There are at least a dozen canyon’s in the entire world 
(including Grand Canyon). It is reasonable for the proba¬ 
bility of grand canyon to have a value less than 1.0 even 
if the input sparse feature “Canyon” fired with the highest 
confidence from an extremely precise base classifier. Fur¬ 
thermore, the offset term enforces that pe|/( 0 ) = 0 ^ which 
helps when dealing with sparse features. Thus, we only cap¬ 
ture positively correlating features for a label e. Fitting of 
Pf.^f{x) can be either done by minimizing the squared error 
or the log-loss over all instances (video-frames in our case) 
where Xf > 0. We used squared loss in our implemen¬ 
tations as we found it to be more robust near the bound¬ 
aries, especially given that pe|/(a:) is enforced to zero. For 
each instance where Xf > 0 we also have a ground-truth 
value associated with label e as pe- Given training examples 
(wt, Xt, gt)teT where wt is the weight of the example', xt 
is the feature value and gt is the ground-truth, we estimate 
a, P, 7 by solving the following regularized least squares 

(d,/3,7) = aTgmin^wtiPe\fixt)-gtf+Xia'^+P‘^+l‘^) 
t^T 

( 2 ) 

subject to a > 0 and /3 > 0. A is tuned on a held out set to 
minimize the held out squared loss. We estimate 9 billion 
triples of (a, /?, 7 ) and only retain the ones where the esti¬ 
mated a > 0. Since the problem has only 3 variables, we 
can compute the exact derivative and Hessian w.r.t. a, /3 ,7 
at each point and do a Newton update. The various (e, /) 
pairs are processed in parallel. Once the function pg\f{x) is 

*To speed up the implementation, we quantize the feature values in 
buckets of size 10“^ and the weight w is the total number of examples 
that fell in that bucket and g is the mean ground-truth value in that bucket. 


learned, we choose up to the top K features sorted according 
to Pe\f{xmax) (the maximum probability of the label given 
that the feature). The outcome is a set of positively cor¬ 
related features for each label e. 

3.2. Max Calibration Model 

Once the calibrations Pf,^f{x) are learned for (label, fea¬ 
ture) pairs, the max-calibration model is an optimistic pre¬ 
dictor of the probability of each entity e given the set of all 
features that fired in the frame x as 

Pe(a;) = maxpe|/(a;/) (3) 

Note that the max calibration model works best when the 
input features are sparse outputs that have some seman¬ 
tic meaning. Despite the simplicity and robustness of the 
max-calibration model, there are several drawbacks that 
may limit it from yielding the best performance: 

(a) The max calibration model uses noisy ground truth data 
(assumes all frames in the video are associated with the la¬ 
bel). At the very least, we need to correct this by learning 
another model that uses a cleaner ground truth. 

(b) Furthermore, the non-linear operation of doing a max 
on all the probabilities may result in overly optimistic pre¬ 
dictions for labels which have a lot of features F].. Hence 
the output will no longer be well calibrated (unless we learn 
another calibrator on top of the max-calibrated score). 

(c) Each feature is treated independently, hence the 
max-calibration model cannot capture the correlations be¬ 
tween the various features. Max calibration model can only 
deal with sparse features. For example, we cannot use con¬ 
tinuous valued features like the output of an intermediate 
layer from a deep network. 

As a result, we will use the max-calibration model as 
a bootstrapping mechanism for training a second order 
model. 

Hard Negative Mining: The max-calibration model 
provides calibrated probabilities for all labels in our vo¬ 
cabulary. It is a simple model and is extremely efficient 
to compute. Hence, we will exploit this property to mine 
good positives and hard-negatives (i.e., the ones that score 
high according to the max-calibration model). The mining 
process for an entity e can be described formally as sort¬ 
ing (from highest to lowest) all the training examples (video 
frames in our case) according to the max-calibration score 
of e and retaining the top M examples. We chose M such 
that it captures more than 95% of the positives. Since 
the number of training examples is huge (e.g., 3.6 billion 
frames, in our case), we do this approximately using map- 
reduce where, in each of the W workers, we collect the top 
k examples and choose the top M examples from the result¬ 
ing kW examples. Although this approach is approximate, 
if kW is chosen to be sufficiently larger than M, we can 
guarantee that we can recover almost all of the top M ex¬ 
amples. The expected number of the true top M examples 




that will be recovered by choosing the top M examples from 
this kW sized set is given as 

E{k,W,M) = k+Y. 1-w En 

(4) 

For example if, M = 80000 examples, W = 4000 workers 
and k = AO examples/worker, this evaluates to 79999.8126. 
In general, setting kW = 2M yields a good guarantee. In 
the next section, we show how to get the top k examples 
from each worker efficiently. 

3.3. Choosing Top-fc Examples per Worker 

The brute force approach to achieve this is to compute 
the max calibration score using (3) for each label e given 
the features x for all examples that belong to the worker w 
and insert pe (x), e, x into a fc-sized priority queue (which is 
keyed by the max calibrated probability Pe(x)) for the label 
e. Unfortunately, this can be very time consuming, espe¬ 
cially when assigning millions of examples per worker. In 
this section, we propose an approach that makes this min¬ 
ing extremely efficient and is particularly tuned towards the 
max-calibration model. The idea is to only score labels 
which are guaranteed to enter the priority queue. As a re¬ 
sult of this, computing pg|j(x/) becomes less and less fre¬ 
quent as more examples are processed in the worker and the 
priority queue for e continues to get updated. 

From the calibration step, we have a set of shortlisted 
features for each label e. If we invert this list, we get 
a set of shortlisted labels Ef for each feature /. In each 
worker w, we also maintain a priority queue Q(e,w) for 
each label e that stores up to the top-/c examples (according 
to the max-calibration score). 

In each worker w, for each feature /, we store an inverse 
lookup to labels Ef{w) which is initially Ef. In addition, 
we also store a minimum feature firing threshold r/ g such 
that only if (x/ > g) for some /, we will insert e into the 
priority queue. Initially r/ g = 0, which implies that every 
label e G Ef for all / such that Xf > 0 will be scored. 

Let the minimum max-calibration value in the priority 
queue be Qmin{e, w). This is zero if the size of the priority 
queue is less than k, otherwise (when the size is equal to k) 
it is the smallest element in the queue. 

For each training example, let x be the sparse feature 
vector and g be the corresponding ground-truth. 

Let Pg(x) be the score of e according to the 
max-calibration model for this instance. Instead of 
computing Pe(x) explicitly for all labels using the 
max-calibration model, we only compute it for a sub¬ 
set of the labels which are guaranteed to enter the prior¬ 
ity queue Q(e, w) as follows: Pe(x) is initially zero for 
all e (an empty map). For each feature f : Xf > 0 and 
for each label e S Ef{w), if Xf < Tf^e and pg|j^(x/) < 
Qmin{e), we update r/g to Xf. In addition if Pf,\f{xf) < 


Qmin(,e), we remove e from Ef(w) (so we have fewer la¬ 
bels in the inverse lookup for /). On the other hand, if 
Xf > Tf^e andpg|/(x/) > Q™n(e), we update pg(x) to 
max(pg(x),pg|j^(x/)). For each e where pe(x) > 0, insert 
{pg(x), e, X, g} to the priority queue Q{e, w). 

These M examples become the training data for another 
second order classiher. Since the second order model is 
trained only on these M examples, it is important to re¬ 
tain the distribution at inference time. The second order 
model may not do well with an odd-man-out data point 
(that is not in the distribution of these M examples) is seen. 
Hence at inference time, we put a threshold on the lowest 
max-calibration score of any positive example seen in the 
M training examples. 

Popular Labels: For popular YouTube labels like 
Minecraft, M needs to be sufficiently high to capture a sig- 
nihcant fraction of the positives. For example, Minecraft 
occurs in 3% of YouTube videos in our set. On a dataset 
of 10 million videos with frames sampled at Ifps and each 
video having an average length of 4 minutes, we have 72 
million frames that correspond to Minecraft. In this case 
M needs to be much higher than 72 million and this is 
not feasible to fit in a single machine.^ When considering 
each example for such labels to be added to the top-fc list 
in each worker, we do a random sampling with a probabil¬ 
ity p which is proportional to [i.e., the step 

“insert {pg(x), e, x} to the priority queue” is done with this 
probability]. 

3.4. Training the Second Order Model 

Given the top M examples of positives and negatives ob¬ 
tained from the hard negative mining using the first order 
max-calibration model, the second order model learns to 
discriminate the good positives and hard-negatives in this 
set. At inference time, for each example X, we check 
if the max-calibration score is at least as much as the 
max-calibration score of any positive example in this train¬ 
ing set. Note that checking if the max-calibration is at least 
Tg is equivalent to checking if at least one of the feature val¬ 
ues passes a certain threshold. Formally put 

maxpg|/(x/) > Tg = |J/(x/ > pfi]:iTe)) (5) 
^ f 

Pe\fixf) is monotonically increasing and hence its inverse 
is uniquely defined. At the inference time, we check if 
at least one feature exceeds the certain threshold — 
pf\^f{Te) which is pre-computed during initialization. If the 
max-calibration score exceeds this threshold, we apply the 
second order model (5e(x) to compute the hnal score of the 
label e for the example x. If the max calibration score does 

^The second order classifier is trained in parallel across different work¬ 
ers, but we train each label in a single machine. 




not exceed this threshold, the final score Pe(x) of the la¬ 
bel e is set to zero. This is essentially a 2-stage cascade 
model, where a cheap max calibration model is used as 
an initial filter, followed by a more accurate and more ex¬ 
pensive second-order model. We used logistic regression 
and mixture of experts as candidates for this second order 
model. 

3.5. Mixture of Experts 

Recall that we train a binary classifier for each label e. 
y = 1 denotes the existence of e in the features x. Mixture 
of experts (MoE) was first proposed by Jacobs and Jordan 
[15]. In an MoE model, each individual component models 
a different binary probability distribution. The probability 
according to the mixture of H experts is given as 

p(y = l|x) = ^p(/i|x)p(?/= (6) 

h 

where the conditional probability of the hidden state given 
the features is a soft-max over H + 1 states p(/i|x) = 
exp(w, x) ^ ^ dummy 

l+Eh'exp(w^,x) ^ ^ 

state that always results in the non-existence of the entity. 
When iJ = 1, it is a product of two logistics and hence is 
more general than a single logistic regression. The condi¬ 
tional probability of y = 1 given the hidden state and the 
features is a logistic regression given as p{y — l|x, h) — 
ct(u^x). The parameters to be estimated are the softmax 
gating weights w/j for each hidden state and the expert lo¬ 
gistic weights u/j. Eor the sake of brevity, we will denote 
Py\x = p(,y = l|x), Ph\^ = p{h\x.) and pn = p{y = l|x, h) 
Given a set of training data for each label 

where x^ is the feature vector and gi is the corresponding 
boolean ground-truth, we optimize the regularized loss of 
the data which is given by 

N 

'^w,C [py|x.,5t] + A 2 (|1w||^ -f ||u|i^) (7) 

where the loss function C{p, g) is the log-loss given as 

^P,g) =-g^ogp- { 1 - g)\og{l-p) ( 8 ) 

and Wi is the weight for the example. 

Optimization: Note that we could directly write the 
derivative of C [pj,|xj 5 ] with respect to the softmax weight 
w/i and the logistic weight U/j as 

[Py|x; i?] Ph\x (Py|ti,x Py\x} (Py|x 

d-Wh Pj/|x(l -Py|x) 

[Py|x; i?] _ P/i|xPy|/i,x(l Py\h,x) {Py\x i?) 

Py|x(l Py|x) 

Our implementation uses the ceres library [2] to solve the 
minimization in (7) to obtain the weights {wh,Uh) using 


the Broyden Eletcher Goldfarb Shanno algorithm (LBEGS). 
We also implemented an EM variant where the collected 
statistics are used to re-estimate the softmax and the logistic 
weights (both of which are convex problems). However, in 
practice, we found that LBEGS converges much faster than 
EM and also produces better objective in most cases. For 
all our experiments, we report accuracy numbers using the 
LBEGS optimization. 

Initialization; When H (the number of mixtures) is 
greater than one, we select H positive examples according 
to the non-deterministic KMeans-f-l- sampling strategy [4]. 
The features of these positive examples become the gating 
weights (the offset term is set to zero). The expert weights 
are all initialized to zero. We then run LBEGS until the 
relative change in the objective function ceases to exceed 
10“®. When 77 = 1, we initialize the expert weights to the 
weights obtained by solving a logistic regression, while the 
gating weights are all set to zero. Such an initialization en¬ 
sures that the likelihood of the trained MoE model is at least 
as much as the one obtained from the logistic regression. In 
our experiments, we also found consistent improvements by 
using MoE with 1 mixture compared to a logistic regression 
and small improvements by training a MoE with (up to) 5 
mixtures compared to a single mixture. Furthermore, for 
multiple mixtures, we run several random restarts and pick 
the one that has the best objective. 

Hyperparameter Selection: In order to determine the 
best L 2 weight A on and U/j, we split the training data 
into two equal video dis-joint sets and grid search over A 
and train a logistic regression with an L 2 weight of A. For 
our experiments, we start A = 10“^ and increase it by a 
factor of 2 in each step. Once we find that the holdout loss 
is starting to increase, we stop the search. 

Training times:The total training time (from calibra¬ 
tion to training the MoE model) for training the frame-level 
model takes less than 8 hours by using all features on the 
10.8 million training videos by distributing the load across 
4000 machines. When the number of mixtures is greater 
than one, the majority of the training time is spent doing 
the random restarts and hyper-parameter sweep. The cor¬ 
responding training time for the video level model is any¬ 
where between twelve to sixteen hours on the 12 million 
set. Training the same models on the sports videos takes 
less than an hour. Inference takes < Is per 4-minute video. 

4. Video and Frame-Level Prediction LSTM 

We also tackle the task of frame-level and video-level 
prediction using recurrent neural networks. In this section 
we describe our approach. 

A recurrent network operates over a temporally ordered 
set of inputs x = {xi,..., Xt}- Xt corresponds to the fea¬ 
tures at time step t. For each time step t the network com¬ 
putes a hidden state ht which depends on ht-i and the cur- 







rent input, and bias terms. The output yt is computed as a 
function of the hidden state at the current time step; 

ht = H{W^Xt + Whht.i + bh) (9) 

yt = Woht + bo (10) 

where W denote the weight matrices. Wx denotes the 
weight matrix corresponding to the input features, Wh de¬ 
notes the weights by which the previous hidden state is mul¬ 
tiplied, and Wo denotes the weight matrix that is used to 
compute the output, b^ and bo denote the hidden and output 
bias. Ti is the hidden state activation function, and is typi¬ 
cally chosen to be either the sigmoid or the tanh function. 

This type of formulations suffers from the vanishing 
gradient problem [6]. Long Short-Term Memory neurons 
have been proposed by Schmidhuber [12] as type of neu¬ 
ron which does not suffer from this. Thus, LSTM networks 
can learn longer term dependencies between inputs, which 
is why we chose to use them for our purposes. 

The output of the hidden layer ht for LSTM is computed 
as follows: 


it — criyVxiXt + Whiht-i + WctCt-i + i>i) (H) 

ft = <xiWxfXt + Whfht-i + WcfCt-i + hf) (12) 

ct = ftCt-i + it tanh{WxcXt + Whch-i + be) (13) 
Ot = cr(WxoXt + Whoht-i + WcoCt + bo) (14) 

ht = Ot tanh(ct) (15) 


where a is the sigmoid function. The main difference be¬ 
tween this formulation and the RNN is that the it decides 
whether to use the input to update the state, ft decides 
whether to forget the state, and Ot decides whether to out¬ 
put. In some sense, this formulation introduces data control 
flow driven by the state and input of the network. 

For the first time step, we set cq = 0 and ho = 0. How¬ 
ever, the initial states could also be represented by using a 
learned bias term. 

For the purposes of both video and frame-level classifi¬ 
cation, we consider a neural network which has frame-level 
classifications as inputs. These scores are be represented in 
a sparse vector St = {s[|Vs] > 0, s] G St}, where St is a 
vector containing the scores for all classes at time t. The 
first layer of the network at time t computes its output as 
Xt = where b is the bias term. This formula¬ 

tion of St, significantly reduces the amount of computation 
needed for both the forward and backward pass because 
it only considers those elements of St which have values 
greater than zero. For our networks, the number of non-zero 
elements per frame is less than 1% of the total possible. In 
or experiments, each class is represented internally (as w*) 
with 512 weights. 


On top of this layer, we stack 5 LSTM layers with 512 
units each [11]. We unroll the LSTM layers for 30 time 
steps, which is equivalent to using 30 seconds’ worth of 
video at a time for training. Each of the top LSTM layers 
is further connected to a hierarchical softmax layer [21]. In 
this layer, we use a splitting factor of 10, and a random tree 
to approximate the hierarchy. 

Similarly to [22] we use a linearly increasing weight for 
each time step, starting with 1 /30 for the first frame, and 
assigning a weight of 1 to the last frame. This allows the 
model to not have to be penalized heavily when trying to 
make a prediction with few frames. We also investigated 
using an uniform weight for each frame and max-pooling 
over the LSTM layer, but in our video-level metrics, these 
methods proved inferior to the linear weighting scheme. 

In our dataset, videos have an average of 240 seconds. 
Therefore, when using the 30-frame unrolled LSTM model, 
it is not clear what to do in order to obtain a video-level 
prediction. In order to process the entire video, we split it 
into 30-second chunks. Starting with the first chunk of the 
video, we predict at every frame, and save the state at the 
end of the sequence. When processing subsequent chunks, 
we feed the previously saved state back into the LSTM. At 
the end of the video we have as many sets of predictions 
as we have frames. We investigated using max-pooling and 
average pooling over the predictions, and as an alternative, 
taking the prediction at the last frame. 

For every video, our LSTM model produces a 512- 
dimensional representation. This is the state of the top-most 
LSTM layer at the last frame in the video. This also allows 
other classifiers to be trained using this representation. 

The training was done using distributed stochastic gradi¬ 
ent descent [8] using 20 model replicas. We used a learning 
rate of 0.3, and employed the AdaGrad update rule [10]. 
The training took less than 5 days before convergence. In¬ 
ference takes <1.4 seconds for the average 4-minute video. 

5. Experimental Setup 

Datasets: We created a new dataset of 12 million 
YouTube videos spanning about 150,000 visual labels 
from Freebase [7]. We selected these 12 million videos 
such that each of them have a view count of at least 10,000. 
The 150,000 labels were selected by removing music topics 
such as songs, albums and people (to remain within the vi¬ 
sual domain and not having to concentrate on face recogni¬ 
tion). YouTube provides the labels of the videos which ob¬ 
tained by running a Freebase-based annotator [27] on title, 
description and other metadata sources. We retrieved the 
videos belonging to each label by using the YouTube Top¬ 
ics API [ 1 ]. This annotation is fairly reliable for high view 
videos where the weighted precision is over 95% based 
on human evaluation. Many of the labels are however ex¬ 
tremely fine-grained, making them visually very similar or 
even indistinguishable. Some examples are Super Mario 


1, Super Mario 2, FIFA World Cup 2014, FIFA World Cup 
2015, african elephant, asian elephant, etc. These annota¬ 
tions are only available at the video-level. Another dataset 
that we used for is Sports-IM dataset [16] that consists of 
roughly 1.2 million YouTube sports videos annotated with 
487 classes. We will evaluate our models both at the video 
level and the frame level. 

Features: We extract two sets of sparse features from 
each frame (sampled at 1 fps) for the videos in our training 
and test set. One set of features are the prediction outputs 
of an Inception-dtnvtd deep neural network [29] trained 
on YouTube thumbnail images. This model by itself per¬ 
forms much worse on our training set, because YouTube 
thumbnails are noisy (tend to be visually more attractive 
than describing the concept in the video) and is only a sin¬ 
gle frame from the entire YouTube video. The number of 
unique sparse features bring from this model on our 10 mil¬ 
lion training set is about 110,000. In our experiments sec¬ 
tion, we will abbreviate these features as TM which stands 
for thumbnail model. Another set of features are the predic¬ 
tions of a deep neural network with a similar architecture as 
the TM model, but is largely trained on Flickr data. The tar¬ 
get labels are again from the metadata of the Flickr photos 
and are similar spirit to image net labels [18]. Moreover, 
they are much less bne-grained than the YouTube labels. 
The vocabulary size of these labels is about 17,000. For 
example, the label Grand Canyon won’t be present. Instead 
the label Canyon will be present. We will abbreviate these 
features as IM that stands for Image models. 

For both models we process the images by brst resiz¬ 
ing them to 256 x 256 pixels, then randomly sampling a 
220 X 220 region and randomly hipping the image horizon¬ 
tally with 50% probability when training. Similarly to the 
LSTM model, the training was performed on a cluster using 
Downpour Stochastic Gradient Descent [9] with a learning 
rate of 10“^ in conjunction with a momentum of 0.9 and 
weight decay of 0.0005. 

5.1. Training and Evaluation 

We partition the training data using a 90/10 split. This 
results in about 10.8 million videos in the training partition 
and 1.2M videos in the test partition. The ground-truth is 
only available at video-level. We train two kinds of models: 

Frame level models: These models are trained to pre¬ 
dict the label from a single frame. To provide robustness, 
a contagious idea is to feed MiCRObE the aggregated fea¬ 
tures over more than one frame. The features for each frame 
are obtained by averaging the features in a ±2 second win¬ 
dow. For training the max-calibration model, we will use 
all the frames from each video in our training set and as¬ 
sume that every frame is associated with the video level 
ground-truth labels. For mining the collection of hard neg¬ 
atives and good positives for the second stage model, we 
randomly sample 10 frames from each video and mine the 



Figure 2: The ROC for frame-level predictions for two mod¬ 
els using human ratings for ground truth. 

top 100,000 scoring examples for each label from the re¬ 
sulting 108 million frames (where the scoring is done using 
the maxcal model). At the inference time, we annotate each 
frame (sampled at Ifps) in the video using the trained Mi¬ 
CRObE cascade model. The output of the LSTM model is 
evaluated at every frame, while passing the state forward. 

Since we don’t have frame level ground truth at such 
a large scale, we either (a) convert the frame level labels 
to the video level labels using the max-aggregation of the 
frame level probabilities and evaluate against the video¬ 
level ground truth (See Table 1), or (b) send a random sam¬ 
ple of frames from a random sample of output labels to hu¬ 
man raters (Eigure 2). 

Note that the predictions of the underlying base models 
are entities which have some overlap with the target vocab¬ 
ulary. The precision numbers in the Self column are the 
accuracy of the base classihers by themselves. Eor the com¬ 
bined model IMh-TM, we take the maximum score of an 
entity coming from either of the models (Table 1). 

In order to prepare the data for human rating, we took 
a random set of 6,000 videos which did not appear in the 
training set. Eor each video, we computed the output prob¬ 
abilities for all labels. Eor those labels which had an output 
probability of greater than 0.1, we took all the frames which 
passed the thresholding, sorted the scores, and split the en¬ 
tire score range into 25 equally sized buckets. Erom each 
bucket, we randomly sampled a frame and a score. Eor each 
model we evaluated, we randomly sampled 250 labels (with 
25 frames each), and sent this set to human raters. The to¬ 
tal number of labels from which we sampled these 250 was 
3, 541 for MiCRObE, and 1, 568 for the LSTM model. The 
resulting ROC is depicted in Eigure 2. We only considered 
frames for which there was a quorum (at least 2 raters had 
to agree). 

The MiCRObE method is well suited for frame-level 
classihcation due to the fact that during the learning process 




Eeatures 

Dataset 

Self 

MaxCal 

Logit 

Random Negs. 

Logit 

Hard Negs. 

MiCRObE (1 mix) 
Hard Negs. 

MiCRObE (5 mix) 
Hard Negs. 

IM 

YT-12M 

4.0% 

20.0% 

27.0% 

29.2% 

31.3% 

32.4% 

TM 

19.0% 

28.0% 

31.0% 

39.8% 

40.6% 

41.0% 

IM+TM 

7.0% 

33.0% 

40.6% 

42.5% 

43.9% 

43.8% 

IM 

Sports-IM 

0.9% 


25.6% 

35.0% 

39.3% 

39.8% 

TM 

1.2% 


33.9% 

45.7% 

46.8% 

49.9% 

IM+TM 

1.5% 

39.0% 

41.0% 

47.8% 

49.8% 

50.2% 


Table 1; Frame level model evaluation against the video-level ground truth. The values in the table represent hit@l. 


Eeatures 

Benchmark 

MaxCal 

Logit 

Hard Negs. 

MiCRObE (1 mix) 
Hard Negs. 

MiCRObE (5 mix) 
Hard Negs. 

LSTM 

IM 

YT-12M 

20.0% 


36.2% 

36.6% 

44.4% 

TM 

28.0% 


47.3% 

47.3% 

45.7% 

IM+TM 

29.0% 

49.3% 

50.1% 

49.5% 

52.3% 

IM 

Sports-IM 

28.2% 

45.0% 

46.5% 

47.2% 

52.8% 

TM 

38.6% 

54.5% 

55.4% 

56.0% 

58.8% 

IM+TM 

40.3% 

54.7% 

56.8% 

57.0% 

59.0% 


Table 2; Hit@l for the video level models against the ground truth. 


it actively uses frame-level information and mines hard ex¬ 
amples. As such, it provides better performance than the 
LSTM method on this task (Figure 2). 

Video level models: These models are trained to pre¬ 
dict the labels directly from the aggregated features from 
the video. The sparse features (available at each frame) 
are aggregated at the video level and the fusion models 
are directly trained to predict video-level labels from the 
(early) aggregated features. For this early feature aggrega¬ 
tion, we collect feature specific statistics like mean, top-fc 
(for k = 1, 2, 3,4, 5) of each feature over the entire video. 
For example the label “soccer” from the TM model will 
expand to six different features TM:Soccer:Mean (which 
is the average score of this feature over the entire video), 
TM:Soccer:l (which is the highest score of this feature over 
the video), TM:Soccer:2 (which is the second highest score 
of this feature) and so on. The LSTM model remains un¬ 
changed from the frame-level prediction task. The video¬ 
level label is obtained by averaging the frame-level scores. 
The results are summarized in Table 2. 

On the Sports-IM benchmark, which is video-level, the 
LSTM method yields 59.0% hit@L Karpathy et al. [16] 
report 60.9%, while Tran et al. [31] report 61.1% using a 
single network which was trained specifically for the task, 
starting from pixels. Similarly, Ng et al. [22] report 72.1%. 
However, in order to obtain a single prediction for the video, 
the video is passed through the network 240 times, which 
would not be possible in our scenario, since we are con¬ 
cerned about both learning and inference speed. 

In terms of video classification, MiCRObE was adapted 
to use feature aggregation and it provides comparable per¬ 


formance to LSTM model (a hit@l within 2.8% on YT- 
12M, and within 3% on Sports-IM). The LSTM model, un¬ 
like MiCRObE, learns a representation of the input labels, 
and makes use of the sequential nature of the data. 

Compared to previous work concentrating on large video 
classification, our methods do not require training CNNs on 
the raw video data, which is desirable when working with 
large numbers of videos. Our best performing single-pass 
video-level model is within 2.1% hit@l of the best pub¬ 
lished model which does not need multiple model evalua¬ 
tions per frame [31] (trained directly from frames). 

6. Conclusion 

We studied the problem of efficient large scale video 
classification (12-million videos) with a large label space 
(150,000 labels). We proposed to use image-based classi¬ 
fiers which have been trained either on video thumbnails 
or on Elickr images in order to represent the video frames, 
thereby avoiding a costly pre-training step on video frames. 
We demonstrate that we can organically discover the corre¬ 
lated features for a label using the max calibration model. 
This allows us to bypass the curse of dimensionality by pro¬ 
viding a small set of features for each label. We provided a 
novel technique for hard negative mining using an under¬ 
lying max-calibration model and use it to train a second 
order mixture of experts model. MiCRObE can be used 
as a frame-level classification method that does not require 
human-selected, frame-level ground truth. This is crucial 
when attempting to classify into a large space of labels. 
MiCRObE shows substantial improvements in precision of 
the learnt fusion model against other simpler baselines like 
max-calibration and models trained using random negatives 





































and provides the highest level of performance at the task 
of frame-level classification. We also show how to adapt 
this model for video classification. Finally, we provide an 
LSTM based model that is capable of the highest video¬ 
level performance on YT-12M. Performance could further 
be improved by late-fusing outputs of the two algorithms. 
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